Midterm Project: Effects of Air Pollution on Countries

1. Introduction

Motivation

Air pollution affects blah blah blah… and with the increased worsening in climate and air quality, blah blah blah…. Our group wanted to explore how air pollution has changed over time and affect countries differently. Specifically, we wanted to analyze how a country’s economic and social position can either increase, decrease, or not have observable impact on the affects of air pollution. In laymen terms, does air pollution affect underdeveloped countries disproportionately?

Set Up

Before we start, we need to ensure that we have all the relevant libraries installed and imported.

Run these in the console to install packages in addition to the ezids package.

install.packages("tidyverse")
install.packages("rworldmap")
install.packages("tmap")
install.packages("spData")
install.packages("sf")
install.packages("ggpubr")

2. Data Sources and Data Wrangling

Data Sources

For our analysis, we will be working with 5 main data sources shown in the table below:

Figure 1: Data Sources
Data Source Link
Deaths Due to Air Pollution of Countries from 1990 - 2017 Kaggle Link
GDP Annual Growth of Countries from 1960 - 2020 Kaggle via WorldBank Link
United Nations Population and Region Data United Nations Link
United Nations ISO-alpha3 code United Nations Link
spData for Map Geometries spData for Mapping Link

The main variables in our datasets will include:

Figure 2: Key Variables
Feature Data Type Unit of Measure Notes and Assumptions
GDP (Gross Domestic Product) Numerical, Continuous $USD This is our chosen proxy for measuring a country’s economic status
Population Size Numerical, Continuous thousands of people Annual UN estimated
Deaths due to Air Pollution Numerical, Continuous deaths per million This is our chosen proxy for measuring the negative affects of air pollution.
Country Qualitative, Categorical N/A 231 countries
SDG Region Qualitative, Categorical N/A UN’s Sustainable Development Goals Region Classification.
Sub Region Qualitative, Categorical N/A UN’s Sustainable Development Goals Sub-Region Classification.
ISO-alpha3 Country Code Qualitative, Categorical N/A Standard for identifying countries (text ID).
ISO-alpha2 Country Code Qualitative, Categorical N/A Another standard for identifying countries (text ID).
M49 Country Code Numerical, Categorical N/A Another standard for identifying countries (numerical ID).
Year Numerical, Categorical N/A 1990 to 2017
GDP per Capita Numerical, Continuous $USD per person Normalization of GDP to compare between population sizes (calculated).

Data Wrangling

While data from Kaggle are already in a format to be cleaned, downloaded data from United Nations required a little data wrangling. Mainly, we needed to extract just countries’ data from the Excel workbooks and into their own contained csv files. Since we only need to do this once and programming it would take significant time to choose the specific cells that we need, we opted to perform this step outside of R and in Excel. Note that if this were a part of a real production data pipeline, we would take the time to program the data extraction but would likely choose a different programming language such as Python that is a bit more robust in these types of tasks like web scraping and data transformations in Pandas.

UN Data Sample Messy
Figure 3: Sample screenshot of data downloaded from UN including unnecessary elements like banners and other regional data.
UN Data Sample Cleaned
Figure 4: Sample screenshot of transformed UN dataset.

3. Load, Clean, and Inspect Data

Load Data

country_codes_df <- read.csv("data/country_codes.csv", header = TRUE, sep = ",")
air_pollution_df <- read.csv("data/death-rates-from-air-pollution.csv", header = TRUE, sep = ",")
gdp_df <- read.csv("data/GDP_annual_growth.csv", header = TRUE, sep = ",")
population_region_df <- read.csv("data/population_in_thousands_region.csv", header = TRUE, sep = ",")
str(country_codes_df)
## 'data.frame':    249 obs. of  4 variables:
##  $ Country.or.Area: chr  "Andorra" "United Arab Emirates (the)" "Afghanistan" "Antigua and Barbuda" ...
##  $ ISO.alpha2.code: chr  "AD" "AE" "AF" "AG" ...
##  $ ISO.alpha3.code: chr  "AND" "ARE" "AFG" "ATG" ...
##  $ M49.code       : int  20 784 4 28 660 8 51 24 10 32 ...
str(air_pollution_df)
## 'data.frame':    6468 obs. of  7 variables:
##  $ Entity                                         : chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
##  $ Code                                           : chr  "AFG" "AFG" "AFG" "AFG" ...
##  $ Year                                           : int  1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 ...
##  $ Air.pollution..total...deaths.per.100.000.     : num  299 291 279 279 287 ...
##  $ Indoor.air.pollution..deaths.per.100.000.      : num  250 243 232 232 239 ...
##  $ Outdoor.particulate.matter..deaths.per.100.000.: num  46.4 46 44.2 44.4 45.6 ...
##  $ Outdoor.ozone.pollution..deaths.per.100.000.   : num  5.62 5.6 5.61 5.66 5.72 ...
str(gdp_df)
## 'data.frame':    264 obs. of  66 variables:
##  $ Country.Name  : chr  "Aruba" "Afghanistan" "Angola" "Albania" ...
##  $ Country.Code  : chr  "ABW" "AFG" "AGO" "ALB" ...
##  $ Indicator.Name: chr  "GDP (current US$)" "GDP (current US$)" "GDP (current US$)" "GDP (current US$)" ...
##  $ Indicator.Code: chr  "NY.GDP.MKTP.CD" "NY.GDP.MKTP.CD" "NY.GDP.MKTP.CD" "NY.GDP.MKTP.CD" ...
##  $ X1960         : num  NA 537777811 NA NA NA ...
##  $ X1961         : num  NA 548888896 NA NA NA ...
##  $ X1962         : num  NA 546666678 NA NA NA ...
##  $ X1963         : num  NA 751111191 NA NA NA ...
##  $ X1964         : num  NA 800000044 NA NA NA ...
##  $ X1965         : num  NA 1006666638 NA NA NA ...
##  $ X1966         : num  NA 1399999967 NA NA NA ...
##  $ X1967         : num  NA 1673333418 NA NA NA ...
##  $ X1968         : num  NA 1373333367 NA NA NA ...
##  $ X1969         : num  NA 1408888922 NA NA NA ...
##  $ X1970         : num  NA 1748886596 NA NA 78619206 ...
##  $ X1971         : num  NA 1831108971 NA NA 89409820 ...
##  $ X1972         : num  NA 1595555476 NA NA 113408232 ...
##  $ X1973         : num  NA 1733333264 NA NA 150820103 ...
##  $ X1974         : num  NA 2155555498 NA NA 186558696 ...
##  $ X1975         : num  NA 2366666616 NA NA 220127246 ...
##  $ X1976         : num  NA 2555555567 NA NA 227281025 ...
##  $ X1977         : num  NA 2953333418 NA NA 254020153 ...
##  $ X1978         : num  NA 3300000109 NA NA 308008898 ...
##  $ X1979         : num  NA 3697940410 NA NA 411578334 ...
##  $ X1980         : num  NA 3641723322 5930503401 NA 446416106 ...
##  $ X1981         : num  NA 3478787909 5550483036 NA 388958731 ...
##  $ X1982         : num  NA NA 5550483036 NA 375895956 ...
##  $ X1983         : num  NA NA 5784341596 NA 327861833 ...
##  $ X1984         : num  NA NA 6131475065 1857338012 330070689 ...
##  $ X1985         : num  NA NA 7553560459 1897050133 346737965 ...
##  $ X1986         : num  405463417 NA 7072063345 2097326250 482000594 ...
##  $ X1987         : num  487602458 NA 8083872012 2080796250 611316399 ...
##  $ X1988         : num  596423607 NA 8769250550 2051236250 721425939 ...
##  $ X1989         : num  695304363 NA 10201099040 2253090000 795449332 ...
##  $ X1990         : num  764887117 NA 11228764963 2028553750 1029048482 ...
##  $ X1991         : num  872138715 NA 10603784541 1099559028 1106928583 ...
##  $ X1992         : num  958463184 NA 8307810974 652174991 1210013652 ...
##  $ X1993         : num  1082979721 NA 5768720422 1185315468 1007025755 ...
##  $ X1994         : num  1245688268 NA 4438321017 1880951520 1017549124 ...
##  $ X1995         : num  1320474860 NA 5538749260 2392764853 1178738991 ...
##  $ X1996         : num  1379960894 NA 7526446606 3199642580 1223945357 ...
##  $ X1997         : num  1531944134 NA 7648377413 2258515610 1180597273 ...
##  $ X1998         : num  1665100559 NA 6506229607 2545967253 1211932398 ...
##  $ X1999         : num  1722798883 NA 6152922943 3212119044 1239876305 ...
##  $ X2000         : num  1873452514 NA 9129594819 3480355189 1429049198 ...
##  $ X2001         : num  1920111732 NA 8936063723 3922099471 1546926174 ...
##  $ X2002         : num  1941340782 4055179566 15285594828 4348070165 1755910032 ...
##  $ X2003         : num  2021229050 4515558808 17812705294 5611492283 2361726862 ...
##  $ X2004         : num  2228491620 5226778809 23552052408 7184681399 2894921778 ...
##  $ X2005         : num  2330726257 6209137625 36970918699 8052075642 3159905484 ...
##  $ X2006         : num  2424581006 6971285595 52381006892 8896073938 3456442103 ...
##  $ X2007         : num  2615083799 9747879532 65266452081 10677321490 3952600602 ...
##  $ X2008         : num  2745251397 10109225814 88538611205 12881354104 4085630584 ...
##  $ X2009         : num  2498882682 12439087077 70307163678 12044223353 3674409558 ...
##  $ X2010         : num  2390502793 15856574731 83799496611 11926928506 3449966857 ...
##  $ X2011         : num  2549720670 17804292964 111789686464 12890765324 3629203786 ...
##  $ X2012         : num  2534636872 20001598506 128052853643 12319830252 3188808943 ...
##  $ X2013         : num  2701675978 20561069558 136709862831 12776217195 3193704343 ...
##  $ X2014         : num  2765363128 20484885120 145712200313 13228144008 3271808157 ...
##  $ X2015         : num  2919553073 19907111419 116193649124 11386846319 2789870188 ...
##  $ X2016         : num  2965921788 18017749074 101123851090 11861200797 2896679212 ...
##  $ X2017         : num  3056424581 18869945678 122123822334 13019693451 3000180750 ...
##  $ X2018         : num  NA 18353881130 101353230785 15147020535 3218316013 ...
##  $ X2019         : num  NA 19291104008 88815697793 15279183290 3154057987 ...
##  $ X2020         : logi  NA NA NA NA NA NA ...
##  $ X             : logi  NA NA NA NA NA NA ...
str(population_region_df)
## 'data.frame':    235 obs. of  78 variables:
##  $ SDGRegion   : chr  "SUB-SAHARAN AFRICA" "SUB-SAHARAN AFRICA" "SUB-SAHARAN AFRICA" "SUB-SAHARAN AFRICA" ...
##  $ SubRegion   : chr  "Eastern Africa" "Eastern Africa" "Eastern Africa" "Eastern Africa" ...
##  $ Country     : chr  "Burundi" "Comoros" "Djibouti" "Eritrea" ...
##  $ Notes       : int  NA NA NA NA NA NA NA NA 1 2 ...
##  $ Country.code: int  108 174 262 232 231 404 450 454 480 175 ...
##  $ Type        : chr  "Country/Area" "Country/Area" "Country/Area" "Country/Area" ...
##  $ Parent.code : int  910 910 910 910 910 910 910 910 910 910 ...
##  $ X1950       : chr  "  2 309" "   159" "   62" "   822" ...
##  $ X1951       : chr  "  2 360" "   163" "   63" "   835" ...
##  $ X1952       : chr  "  2 406" "   167" "   65" "   849" ...
##  $ X1953       : chr  "  2 449" "   170" "   66" "   865" ...
##  $ X1954       : chr  "  2 492" "   173" "   68" "   882" ...
##  $ X1955       : chr  "  2 537" "   176" "   70" "   900" ...
##  $ X1956       : chr  "  2 585" "   179" "   71" "   919" ...
##  $ X1957       : chr  "  2 636" "   182" "   74" "   939" ...
##  $ X1958       : chr  "  2 689" "   185" "   76" "   961" ...
##  $ X1959       : chr  "  2 743" "   188" "   80" "   983" ...
##  $ X1960       : chr  "  2 798" "   191" "   84" "  1 008" ...
##  $ X1961       : chr  "  2 852" "   194" "   89" "  1 033" ...
##  $ X1962       : chr  "  2 907" "   197" "   94" "  1 060" ...
##  $ X1963       : chr  "  2 964" "   200" "   101" "  1 089" ...
##  $ X1964       : chr  "  3 026" "   204" "   108" "  1 118" ...
##  $ X1965       : chr  "  3 094" "   207" "   115" "  1 148" ...
##  $ X1966       : chr  "  3 170" "   211" "   123" "  1 179" ...
##  $ X1967       : chr  "  3 253" "   216" "   131" "  1 210" ...
##  $ X1968       : chr  "  3 337" "   221" "   140" "  1 243" ...
##  $ X1969       : chr  "  3 414" "   225" "   150" "  1 276" ...
##  $ X1970       : chr  "  3 479" "   230" "   160" "  1 311" ...
##  $ X1971       : chr  "  3 530" "   235" "   169" "  1 347" ...
##  $ X1972       : chr  "  3 570" "   239" "   179" "  1 385" ...
##  $ X1973       : chr  "  3 605" "   244" "   191" "  1 424" ...
##  $ X1974       : chr  "  3 646" "   250" "   205" "  1 464" ...
##  $ X1975       : chr  "  3 701" "   257" "   224" "  1 505" ...
##  $ X1976       : chr  "  3 771" "   266" "   249" "  1 548" ...
##  $ X1977       : chr  "  3 854" "   276" "   277" "  1 592" ...
##  $ X1978       : chr  "  3 949" "   287" "   308" "  1 637" ...
##  $ X1979       : chr  "  4 051" "   297" "   336" "  1 684" ...
##  $ X1980       : chr  "  4 157" "   308" "   359" "  1 733" ...
##  $ X1981       : chr  "  4 267" "   318" "   375" "  1 785" ...
##  $ X1982       : chr  "  4 380" "   327" "   385" "  1 837" ...
##  $ X1983       : chr  "  4 498" "   336" "   394" "  1 891" ...
##  $ X1984       : chr  "  4 621" "   345" "   406" "  1 946" ...
##  $ X1985       : chr  "  4 751" "   355" "   426" "  2 004" ...
##  $ X1986       : chr  "  4 887" "   366" "   454" "  2 065" ...
##  $ X1987       : chr  "  5 027" "   377" "   490" "  2 127" ...
##  $ X1988       : chr  "  5 169" "   388" "   529" "  2 186" ...
##  $ X1989       : chr  "  5 307" "   400" "   564" "  2 231" ...
##  $ X1990       : chr  "  5 439" "   412" "   590" "  2 259" ...
##  $ X1991       : chr  "  5 565" "   424" "   607" "  2 266" ...
##  $ X1992       : chr  "  5 686" "   436" "   615" "  2 258" ...
##  $ X1993       : chr  "  5 798" "   449" "   619" "  2 239" ...
##  $ X1994       : chr  "  5 899" "   462" "   622" "  2 218" ...
##  $ X1995       : chr  "  5 987" "   475" "   630" "  2 204" ...
##  $ X1996       : chr  "  6 060" "   489" "   644" "  2 196" ...
##  $ X1997       : chr  "  6 122" "   502" "   661" "  2 195" ...
##  $ X1998       : chr  "  6 186" "   515" "   680" "  2 206" ...
##  $ X1999       : chr  "  6 267" "   529" "   700" "  2 237" ...
##  $ X2000       : chr  "  6 379" "   542" "   718" "  2 292" ...
##  $ X2001       : chr  "  6 526" "   556" "   733" "  2 375" ...
##  $ X2002       : chr  "  6 704" "   569" "   747" "  2 481" ...
##  $ X2003       : chr  "  6 909" "   583" "   760" "  2 601" ...
##  $ X2004       : chr  "  7 132" "   597" "   772" "  2 720" ...
##  $ X2005       : chr  "  7 365" "   612" "   783" "  2 827" ...
##  $ X2006       : chr  "  7 608" "   626" "   795" "  2 918" ...
##  $ X2007       : chr  "  7 862" "   642" "   805" "  2 997" ...
##  $ X2008       : chr  "  8 126" "   657" "   816" "  3 063" ...
##  $ X2009       : chr  "  8 398" "   673" "   828" "  3 120" ...
##  $ X2010       : chr  "  8 676" "   690" "   840" "  3 170" ...
##  $ X2011       : chr  "  8 958" "   707" "   854" "  3 214" ...
##  $ X2012       : chr  "  9 246" "   724" "   868" "  3 250" ...
##  $ X2013       : chr  "  9 540" "   742" "   883" "  3 281" ...
##  $ X2014       : chr  "  9 844" "   759" "   899" "  3 311" ...
##  $ X2015       : chr  "  10 160" "   777" "   914" "  3 343" ...
##  $ X2016       : chr  "  10 488" "   796" "   929" "  3 377" ...
##  $ X2017       : chr  "  10 827" "   814" "   944" "  3 413" ...
##  $ X2018       : chr  "  11 175" "   832" "   959" "  3 453" ...
##  $ X2019       : chr  "  11 531" "   851" "   974" "  3 497" ...
##  $ X2020       : chr  "  11 891" "   870" "   988" "  3 546" ...
str(world)
## tibble [177 × 11] (S3: sf/tbl_df/tbl/data.frame)
##  $ iso_a2   : chr [1:177] "FJ" "TZ" "EH" "CA" ...
##  $ name_long: chr [1:177] "Fiji" "Tanzania" "Western Sahara" "Canada" ...
##  $ continent: chr [1:177] "Oceania" "Africa" "Africa" "North America" ...
##  $ region_un: chr [1:177] "Oceania" "Africa" "Africa" "Americas" ...
##  $ subregion: chr [1:177] "Melanesia" "Eastern Africa" "Northern Africa" "Northern America" ...
##  $ type     : chr [1:177] "Sovereign country" "Sovereign country" "Indeterminate" "Sovereign country" ...
##  $ area_km2 : num [1:177] 19290 932746 96271 10036043 9510744 ...
##  $ pop      : num [1:177] 885806 52234869 NA 35535348 318622525 ...
##  $ lifeExp  : num [1:177] 70 64.2 NA 82 78.8 ...
##  $ gdpPercap: num [1:177] 8222 2402 NA 43079 51922 ...
##  $ geom     :sfc_MULTIPOLYGON of length 177; first list element: List of 3
##   ..$ :List of 1
##   .. ..$ : num [1:5, 1:2] -180 -180 -180 -180 -180 ...
##   ..$ :List of 1
##   .. ..$ : num [1:9, 1:2] 178 178 177 177 178 ...
##   ..$ :List of 1
##   .. ..$ : num [1:8, 1:2] 180 180 179 179 179 ...
##   ..- attr(*, "class")= chr [1:3] "XY" "MULTIPOLYGON" "sfg"
##  - attr(*, "sf_column")= chr "geom"
##  - attr(*, "agr")= Factor w/ 3 levels "constant","aggregate",..: NA NA NA NA NA NA NA NA NA NA
##   ..- attr(*, "names")= chr [1:10] "iso_a2" "name_long" "continent" "region_un" ...

Clean Data

First thing that we need to drop unnecessary columns and set datatypes (factor, num, etc.).

Clean air_pollution_df:

# Remove columns
air_pollution_df_cleaned <- subset(air_pollution_df, select = -c(Indoor.air.pollution..deaths.per.100.000., Outdoor.particulate.matter..deaths.per.100.000., Outdoor.ozone.pollution..deaths.per.100.000.))

# Set datatypes
air_pollution_df_cleaned$Entity = factor(air_pollution_df_cleaned$Entity)
air_pollution_df_cleaned$Code = factor(air_pollution_df_cleaned$Code)
air_pollution_df_cleaned$Year = factor(air_pollution_df_cleaned$Year)

# Rename columns
air_pollution_df_cleaned <- rename(air_pollution_df_cleaned, Country = Entity, ISO.alpha3.code = Code, Deaths.Air.Pollution.per.100k = Air.pollution..total...deaths.per.100.000.)

str(air_pollution_df_cleaned)
## 'data.frame':    6468 obs. of  4 variables:
##  $ Country                      : Factor w/ 231 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ ISO.alpha3.code              : Factor w/ 197 levels "","AFG","AGO",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Year                         : Factor w/ 28 levels "1990","1991",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Deaths.Air.Pollution.per.100k: num  299 291 279 279 287 ...

Clean gdp_df:

# Remove columns
gdp_df_cleaned <- subset(gdp_df, select = -c(Indicator.Name, Indicator.Code))

# Pivot wide to long dataset (data melt)
gdp_df_cleaned <- gdp_df_cleaned %>%
  pivot_longer(
    cols = starts_with("X"), 
    names_to = "Year", 
    values_to = "GDP.USD", 
    values_drop_na = TRUE
  )

# Remove characters from column
gdp_df_cleaned$Year<-gsub("X","",as.character(gdp_df_cleaned$Year))

# Set datatypes
gdp_df_cleaned$Country.Name = factor(gdp_df_cleaned$Country.Name)
gdp_df_cleaned$Country.Code = factor(gdp_df_cleaned$Country.Code)
gdp_df_cleaned$Year = factor(gdp_df_cleaned$Year)

# Rename columns
gdp_df_cleaned <- rename(gdp_df_cleaned, Country = Country.Name, ISO.alpha3.code = Country.Code)

str(gdp_df_cleaned)
## tibble [12,401 × 4] (S3: tbl_df/tbl/data.frame)
##  $ Country        : Factor w/ 259 levels "Afghanistan",..: 11 11 11 11 11 11 11 11 11 11 ...
##  $ ISO.alpha3.code: Factor w/ 259 levels "ABW","AFG","AGO",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Year           : Factor w/ 60 levels "1960","1961",..: 27 28 29 30 31 32 33 34 35 36 ...
##  $ GDP.USD        : num [1:12401] 405463417 487602458 596423607 695304363 764887117 ...

Clean population_region_df:

# Remove columns
population_region_df_cleaned <- subset(population_region_df, select = -c(Notes, Type, Parent.code))

# Pivot wide to long dataset (data melt)
population_region_df_cleaned <- population_region_df_cleaned %>%
  pivot_longer(
    cols = starts_with("X"), 
    names_to = "Year", 
    values_to = "Population.thousands", 
    values_drop_na = TRUE
  )

# Remove characters from column
population_region_df_cleaned$Year<-gsub("X","",as.character(population_region_df_cleaned$Year))
population_region_df_cleaned <- as.data.frame(apply(population_region_df_cleaned, 2, function(x) gsub("\\s+", "", x))) 

# Set datatypes
population_region_df_cleaned$Country = factor(population_region_df_cleaned$Country)
population_region_df_cleaned$SDGRegion = factor(population_region_df_cleaned$SDGRegion)
population_region_df_cleaned$Country.code = factor(population_region_df_cleaned$Country.code)
population_region_df_cleaned$SubRegion = factor(population_region_df_cleaned$SubRegion)
population_region_df_cleaned$Year = factor(population_region_df_cleaned$Year)
population_region_df_cleaned$Population.thousands = as.numeric(population_region_df_cleaned$Population.thousands)

# Rename columns
population_region_df_cleaned <- rename(population_region_df_cleaned, M49.code = Country.code)

str(population_region_df_cleaned)
## 'data.frame':    16685 obs. of  6 variables:
##  $ SDGRegion           : Factor w/ 9 levels "AUSTRALIA/NEWZEALAND",..: 9 9 9 9 9 9 9 9 9 9 ...
##  $ SubRegion           : Factor w/ 22 levels "AUSTRALIA/NEWZEALAND",..: 5 5 5 5 5 5 5 5 5 5 ...
##  $ Country             : Factor w/ 235 levels "Afghanistan",..: 34 34 34 34 34 34 34 34 34 34 ...
##  $ M49.code            : Factor w/ 235 levels "100","104","108",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ Year                : Factor w/ 71 levels "1950","1951",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Population.thousands: num  2309 2360 2406 2449 2492 ...

Clean population_region_df:

# Set datatypes
country_codes_df$M49.code = factor(country_codes_df$M49.code)
country_codes_df$Country.or.Area = factor(country_codes_df$Country.or.Area)
country_codes_df$ISO.alpha3.code = factor(country_codes_df$ISO.alpha3.code)
country_codes_df$ISO.alpha2.code = factor(country_codes_df$ISO.alpha2.code)

str(country_codes_df)
## 'data.frame':    249 obs. of  4 variables:
##  $ Country.or.Area: Factor w/ 249 levels "Afghanistan",..: 6 234 1 10 8 3 12 7 9 11 ...
##  $ ISO.alpha2.code: Factor w/ 248 levels "AD","AE","AF",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ ISO.alpha3.code: Factor w/ 249 levels "ABW","AFG","AGO",..: 7 8 2 14 4 6 10 3 12 9 ...
##  $ M49.code       : Factor w/ 249 levels "4","8","10","12",..: 6 225 1 8 190 2 16 7 3 10 ...

Clean world:

# Set datatypes
world$iso_a2 = factor(world$iso_a2)

# Remove columns
world_df_cleaned <- subset(world, select = c(iso_a2, geom))

str(world_df_cleaned)
## tibble [177 × 2] (S3: sf/tbl_df/tbl/data.frame)
##  $ iso_a2: Factor w/ 175 levels "AE","AF","AL",..: 53 162 48 26 165 89 167 124 72 7 ...
##  $ geom  :sfc_MULTIPOLYGON of length 177; first list element: List of 3
##   ..$ :List of 1
##   .. ..$ : num [1:5, 1:2] -180 -180 -180 -180 -180 ...
##   ..$ :List of 1
##   .. ..$ : num [1:9, 1:2] 178 178 177 177 178 ...
##   ..$ :List of 1
##   .. ..$ : num [1:8, 1:2] 180 180 179 179 179 ...
##   ..- attr(*, "class")= chr [1:3] "XY" "MULTIPOLYGON" "sfg"
##  - attr(*, "sf_column")= chr "geom"
##  - attr(*, "agr")= Factor w/ 3 levels "constant","aggregate",..: NA
##   ..- attr(*, "names")= chr "iso_a2"

Note that we only have geometries for 175 countries, some will not be able to be plot on a map but that is okay.

# clean Air Quality Dataset for merge
airquality <- read.csv("/Users/b/Desktop/DataScience/DataScience/GitHub_Repo/Midterm Project/data/airquality.csv")
str(airquality)
## 'data.frame':    4872 obs. of  7 variables:
##  $ LOCATION : chr  "AUS" "AUS" "AUS" "AUS" ...
##  $ INDICATOR: chr  "POLLUTIONEXP" "POLLUTIONEXP" "POLLUTIONEXP" "POLLUTIONEXP" ...
##  $ SUBJECT  : chr  "EXPOS2PM25" "EXPOS2PM25" "EXPOS2PM25" "EXPOS2PM25" ...
##  $ MEASURE  : chr  "MICGRCUBM" "MICGRCUBM" "MICGRCUBM" "MICGRCUBM" ...
##  $ FREQUENCY: chr  "A" "A" "A" "A" ...
##  $ Year     : int  1990 1995 2000 2005 2010 2011 2012 2013 2014 2015 ...
##  $ AirQ     : num  7.6 7.5 7.37 6.91 6.79 ...
airquality$ISO.alpha3.code<-airquality$LOCATION
airquality$Pollution.Exposure<-airquality$SUBJECT

airquality <- subset(airquality, select = c(ISO.alpha3.code, Year, AirQ,Pollution.Exposure))
str(airquality)
## 'data.frame':    4872 obs. of  4 variables:
##  $ ISO.alpha3.code   : chr  "AUS" "AUS" "AUS" "AUS" ...
##  $ Year              : int  1990 1995 2000 2005 2010 2011 2012 2013 2014 2015 ...
##  $ AirQ              : num  7.6 7.5 7.37 6.91 6.79 ...
##  $ Pollution.Exposure: chr  "EXPOS2PM25" "EXPOS2PM25" "EXPOS2PM25" "EXPOS2PM25" ...

Final DataFrame Construction

Now let’s merge our 4 datasets into one using a series of inner joins using country code and year as keys depending on the specific join. We are using inner joins because we want to drop all null values which would mean either a country does not have a country code or we have more years of data than our smallest year range (the air pollution dataset).

# Join datasets
final_df <- merge(x = air_pollution_df_cleaned, y = country_codes_df, by = 'ISO.alpha3.code')
final_df <- merge(x = final_df, y = gdp_df_cleaned, by = c('ISO.alpha3.code', 'Year'))
final_df <- merge(x = final_df, y = population_region_df_cleaned, by = c('M49.code', 'Year'))
final_df <- merge(x = final_df, y = world_df_cleaned, by.x = 'ISO.alpha2.code', by.y = 'iso_a2', all.x = TRUE) #left join
final_df <- merge(x= final_df, y= airquality, by=c("Year","ISO.alpha3.code")) 


# Remove columns
final_df <- subset(final_df, select = -c(Country.or.Area, Country.y, Country))

# Calculate GDP per capita 
final_df$gdp.per.capita <- final_df$GDP.USD / final_df$Population.thousands

str(final_df)
## 'data.frame':    3782 obs. of  14 variables:
##  $ Year                         : Factor w/ 28 levels "1990","1991",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ ISO.alpha3.code              : Factor w/ 197 levels "","AFG","AGO",..: 3 3 4 4 6 6 7 7 8 8 ...
##  $ ISO.alpha2.code              : Factor w/ 248 levels "AD","AE","AF",..: 8 8 6 6 2 2 10 10 7 7 ...
##  $ M49.code                     : Factor w/ 249 levels "4","8","10","12",..: 7 7 2 2 225 225 10 10 16 16 ...
##  $ Country.x                    : Factor w/ 231 levels "Afghanistan",..: 7 7 2 2 216 216 9 9 10 10 ...
##  $ Deaths.Air.Pollution.per.100k: num  224.5 224.5 82.8 82.8 87.8 ...
##  $ GDP.USD                      : num  11228764963 11228764963 2028553750 2028553750 50701443748 ...
##  $ SDGRegion                    : Factor w/ 9 levels "AUSTRALIA/NEWZEALAND",..: 9 9 4 4 6 6 5 5 6 6 ...
##  $ SubRegion                    : Factor w/ 22 levels "AUSTRALIA/NEWZEALAND",..: 10 10 19 19 21 21 16 16 21 21 ...
##  $ Population.thousands         : num  11848 11848 3286 3286 1828 ...
##  $ geom                         :sfc_MULTIPOLYGON of length 3782; first list element: List of 2
##   ..$ :List of 1
##   .. ..$ : num [1:66, 1:2] 12.3 12.2 12.7 12.9 13.2 ...
##   ..$ :List of 1
##   .. ..$ : num [1:9, 1:2] 13 12.6 12.3 11.9 12.2 ...
##   ..- attr(*, "class")= chr [1:3] "XY" "MULTIPOLYGON" "sfg"
##  $ AirQ                         : num  29.9 100 100 24.5 42.3 ...
##  $ Pollution.Exposure           : chr  "EXPOS2PM25" "EXPOS2PM25" "EXPOS2PM25" "EXPOS2PM25" ...
##  $ gdp.per.capita               : num  947735 947735 617332 617332 27736020 ...

Our dataset is finally ready to be analyzed.

4. EDA - Exploratory Data Analysis

Quick Plots

Let’s start our EDA process by just looking at some quick plots to look at the distribution of data.

Histogram of Air Pollution Induced Deaths, Population, and GDP per Capita

#Deaths
death_hist <- ggplot() + 
      geom_histogram(data = final_df, aes(x = Deaths.Air.Pollution.per.100k), bins=30, color="darkblue", fill="steelblue", binwidth = 5)  +
              labs(title = "Distribution of Deaths per 100,000 from Air Pollution",
                          subtitle = "X Countries, 1990 - 2017", 
                          caption = "Data Source: Kaggle",
                          y = "Count",
                          x = "Deaths per 100,000") +
        theme_minimal() +
        theme(axis.title = element_text(size = 8, face = "bold"),
              panel.grid.major.x = element_blank(),
              panel.grid.minor = element_blank(),
              panel.background = element_blank(),
              panel.grid.major.y = element_blank(),
              axis.line.x = element_line(color = "black"),
              axis.ticks = element_line(color = "black"),
              axis.text = element_text(size = 10),
              legend.position = "none",
              plot.subtitle = element_text(size = 8),
              plot.title = element_text(size = 10, margin = margin(b = 10)))

death_hist <- death_hist + theme(plot.title = element_text(color = "black", size = 12, face = "bold", hjust = 0),
                           plot.subtitle = element_text(color = "black", size = 10, hjust = 0 ),
                           plot.caption = element_text(color = "black", size =8, face = "italic", hjust =0))

death_hist

##Population
pop_hist <- ggplot() + geom_histogram(data = final_df, aes(x = Population.thousands), bins=30, color="darkgreen", fill="green4")  +
              labs(title = "Population Distribution",
                          subtitle = "X Countries, 1990 - 2017", 
                          caption = "Data Source: United Nations",
                          y = "Count",
                          x = "Population") +
              theme_minimal() +
              theme(axis.title = element_text(size = 8, face = "bold"),
                    panel.grid.major.x = element_blank(),
                    panel.grid.minor = element_blank(),
                    panel.background = element_blank(),
                    panel.grid.major.y = element_blank(),
                    axis.line.x = element_line(color = "black"),
                    axis.ticks = element_line(color = "black"),
                    axis.text = element_text(size = 10),
                    legend.position = "none",
                    plot.subtitle = element_text(size = 8),
                    plot.title = element_text(size = 10, margin = margin(b = 10)))

pop_hist <- pop_hist + theme(plot.title = element_text(color = "black", size = 12, face = "bold", hjust = 0),
                           plot.subtitle = element_text(color = "black", size = 10, hjust = 0 ),
                           plot.caption = element_text(color = "black", size =8, face = "italic", hjust =0))

pop_hist

##GDP
gdp_hist <- ggplot() + geom_histogram(data = final_df, aes(x = gdp.per.capita), bins=30, color="orangered4", fill="orangered3")  +
              labs(title = "Distribution of GDP per Capita",
                          subtitle = "X Countries, 1990 - 2017", 
                          caption = "Data Source: United Nations",
                          y = "Count",
                          x = "GDP per Capita") +
              theme_minimal() +
              theme(axis.title = element_text(size = 8, face = "bold"),
                    panel.grid.major.x = element_blank(),
                    panel.grid.minor = element_blank(),
                    panel.background = element_blank(),
                    panel.grid.major.y = element_blank(),
                    axis.line.x = element_line(color = "black"),
                    axis.ticks = element_line(color = "black"),
                    axis.text = element_text(size = 10),
                    legend.position = "none",
                    plot.subtitle = element_text(size = 8),
                    plot.title = element_text(size = 10, margin = margin(b = 10)))

gdp_hist <- gdp_hist + theme(plot.title = element_text(color = "black", size = 12, face = "bold", hjust = 0),
                           plot.subtitle = element_text(color = "black", size = 10, hjust = 0 ),
                           plot.caption = element_text(color = "black", size =8, face = "italic", hjust =0))

gdp_hist
Figure 5,6,7: Histogram of Air Pollution Induced Deaths, Population, and GDP per Capita.

Looks like deaths.air.pollution.per.100k, population, and gdp.per.capita are not normal and are all skewed to the right, indicating that the mean is greater than the median.

Boxplot of Air Pollution Induced Deaths, Population, and GDP per Capita

Let’s look at a boxplot for the outliers.

death_boxplot <- ggplot(final_df, aes(x=SDGRegion, y = Deaths.Air.Pollution.per.100k)) + 
                  geom_boxplot(fill="skyblue")
death_boxplot <- death_boxplot + labs(title="BoxPlot of Deaths per 100,000 from Air Pollution, by SDG Region", x = "SDG Region", y ="Deaths per 100k from Air Polution") + 
  theme_classic() + theme(axis.text.x = element_text(angle = 30, hjust = 1, size = 5))
death_boxplot
Figure 8: Boxplot of Deaths per 100,000 from Air Pollution vs SDG Region

Interesting to note that Australia/New Zealand, Europe, North America seem to have the lowest deaths per 100k from air pollution and are all fairly compactly packed together (low variance) relative to other regions around the world. Furthermore, these region contain the most advanced countries.

Let’s take another look but at SubRegions.

sub_boxplot <- ggplot(final_df, aes(x=SubRegion, y = Deaths.Air.Pollution.per.100k)) + 
                  geom_boxplot(fill="azure3")
sub_boxplot <- sub_boxplot + labs(title="BoxPlot of Deaths per 100k from Air Pollution, by Sub Regions", x = "Sub Regions", y ="Deaths per 100k from Air Polution") + 
  theme_classic() + theme(axis.text.x = element_text(angle = 30, hjust = 1, size = 4))
sub_boxplot
Figure 9: Boxplot of Deaths per 100k from Air Pollution vs Sub Region

Separating out into an even granular grouping of regions show some trends where Australia/New Zealand, North America, Northern Europe, and Western Europe all have low deaths per 100k and have low variance. Historically, these regions consist of countries that have been considered ‘First World’ before our first year of analysis of 1990. We will dig into this more later in our SMART questions.

What does the GDP per capita of these regions look like comparatively? Let’s take a look.

gdp_boxplot <- ggplot(final_df, aes(x=SubRegion, y = gdp.per.capita)) + 
                  geom_boxplot(fill="aquamarine3")
gdp_boxplot <- gdp_boxplot + labs(title="BoxPlot of GDP per Capita, by Sub Regions", x = "Sub Regions", y ="GDP per Capita (USD per person)") + 
  theme_classic() + theme(axis.text.x = element_text(angle = 30, hjust = 1, size = 4))
gdp_boxplot
Figure 10: Boxplot of GDP per Capita vs Sub Region

Interesting to observe that the same subregions that have low deaths caused by air pollution also have high GDP per capita comparatively. We will try to see if we can quantify this relationship later on in our main research analysis.

Map of Countries

Plotting maps and maps with intensities will be useful for us to visualize our data and the results of our analysis.

# Convert to sf object so we can plot it
final_df_sf = st_as_sf(final_df)

# Choose relevant columns
final_df_sf_regions <- subset(final_df_sf, select = c(SDGRegion, SubRegion))
final_df_sf_intensities <- subset(final_df_sf, select = c(gdp.per.capita, Deaths.Air.Pollution.per.100k, Population.thousands))
plot(final_df_sf_regions)
Figure 11: Global Map of SDGRegions and SubRegions
plot(final_df_sf_intensities)
Figure 12: Global Intensity Map of Key Numerical Features, 1990 to 2017

Looks like some inverse correlation between gdp.per.capita and deaths.air.pollution.per.100k.

We can also use ggplot2 to have a bit more control over map plotting.

g1 = ggplot() + geom_sf(data = final_df_sf, aes(fill = Deaths.Air.Pollution.per.100k), color = 'black') + scale_fill_distiller(palette = "YlGnBu") + coord_sf(crs = st_crs(4283)) + theme_minimal()

g1
Figure 13: Global Intensity Map of Deaths due to Air Pollution per 100k People, 1990 to 2017
g2 = ggplot() + geom_sf(data = final_df_sf[final_df_sf$SDGRegion == 'EASTERNANDSOUTH-EASTERNASIA' & final_df_sf$Year == 2017, ], aes(fill = Deaths.Air.Pollution.per.100k), color = 'black') + scale_fill_distiller(palette = "YlGnBu") + coord_sf(crs = st_crs(4283)) + theme_minimal()

g2
Figure 14: Intensity Map of Deaths due to Air Pollution per 100k People in East and Southeastern Asia, 2017

SMART Questions

1. Is there a relationship between population size and Air Quality (Exposure to PM2.5 fine particles)?

-Define PM2.5 - What levels are considered low, moderate, and high? 0-50 | 51-100 |101-150 - source?

Below, we would like to measure the relationship between Population size (in thousands) and Air Quality (average population exposure to more than 10 micrograms/m3). since these variables are numerical, we then confirm the normal distribution of both variables

If the p-value is < .05, then the correlation between population size and air pollution is significant.

str(final_df)
## 'data.frame':    3782 obs. of  14 variables:
##  $ Year                         : Factor w/ 28 levels "1990","1991",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ ISO.alpha3.code              : Factor w/ 197 levels "","AFG","AGO",..: 3 3 4 4 6 6 7 7 8 8 ...
##  $ ISO.alpha2.code              : Factor w/ 248 levels "AD","AE","AF",..: 8 8 6 6 2 2 10 10 7 7 ...
##  $ M49.code                     : Factor w/ 249 levels "4","8","10","12",..: 7 7 2 2 225 225 10 10 16 16 ...
##  $ Country.x                    : Factor w/ 231 levels "Afghanistan",..: 7 7 2 2 216 216 9 9 10 10 ...
##  $ Deaths.Air.Pollution.per.100k: num  224.5 224.5 82.8 82.8 87.8 ...
##  $ GDP.USD                      : num  11228764963 11228764963 2028553750 2028553750 50701443748 ...
##  $ SDGRegion                    : Factor w/ 9 levels "AUSTRALIA/NEWZEALAND",..: 9 9 4 4 6 6 5 5 6 6 ...
##  $ SubRegion                    : Factor w/ 22 levels "AUSTRALIA/NEWZEALAND",..: 10 10 19 19 21 21 16 16 21 21 ...
##  $ Population.thousands         : num  11848 11848 3286 3286 1828 ...
##  $ geom                         :sfc_MULTIPOLYGON of length 3782; first list element: List of 2
##   ..$ :List of 1
##   .. ..$ : num [1:66, 1:2] 12.3 12.2 12.7 12.9 13.2 ...
##   ..$ :List of 1
##   .. ..$ : num [1:9, 1:2] 13 12.6 12.3 11.9 12.2 ...
##   ..- attr(*, "class")= chr [1:3] "XY" "MULTIPOLYGON" "sfg"
##  $ AirQ                         : num  29.9 100 100 24.5 42.3 ...
##  $ Pollution.Exposure           : chr  "EXPOS2PM25" "EXPOS2PM25" "EXPOS2PM25" "EXPOS2PM25" ...
##  $ gdp.per.capita               : num  947735 947735 617332 617332 27736020 ...
qqnorm(final_df$Population.thousands)

qqnorm(final_df$AirQ)

cor(final_df$Population.thousands, final_df$AirQ, method = c("pearson"))
## [1] 0.0653
cor.test(final_df$Population.thousands, final_df$AirQ, method=c("pearson"))
## 
##  Pearson's product-moment correlation
## 
## data:  final_df$Population.thousands and final_df$AirQ
## t = 4, df = 3780, p-value = 0.00006
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.0335 0.0969
## sample estimates:
##    cor 
## 0.0653
loadPkg("ggpubr")
ggscatter(final_df, x = "Population.thousands", y = "AirQ", 
          add = "reg.line", conf.int = TRUE, 
          cor.coef = TRUE, cor.method = "pearson",
          xlab = "Population in thousands", ylab = "Pollution Exposure")

# Plot and Regression line
plot(final_df$Population.thousands, final_df$AirQ, pch = 19, col = "navyblue")
abline(lm(final_df$AirQ ~ final_df$Population.thousands), col = "red", lwd = 3)
text(paste("Correlation:", round(cor(final_df$Population.thousands, final_df$AirQ), 2)), x = 25, y = 95)

# Pearson correlation
pop_poll_cor<- cor(select(final_df, AirQ, Population.thousands, gdp.per.capita,))
pop_poll_cor
##                         AirQ Population.thousands gdp.per.capita
## AirQ                  1.0000               0.0653        -0.2570
## Population.thousands  0.0653               1.0000        -0.0449
## gdp.per.capita       -0.2570              -0.0449         1.0000
#correlation matrix
xkabledply(pop_poll_cor)
Table
AirQ Population.thousands gdp.per.capita
AirQ 1.0000 0.0653 -0.2570
Population.thousands 0.0653 1.0000 -0.0449
gdp.per.capita -0.2570 -0.0449 1.0000
#plot
loadPkg("corrplot")
corrplot(pop_poll_cor)

3. Which regions have the lowest and highest pollution?

library(scales)
region <- ggplot(final_df[which(final_df$Year ==2017),], aes(y= reorder(SDGRegion,AirQ), x= AirQ, fill = "dodgerblue4")) +
  geom_bar(stat ="identity", size = 2.5, width=0.5, position = position_dodge(width=0.5)) +
  scale_fill_manual(values=c("dodgerblue4")) +
  scale_x_continuous(label=comma, limits=c(0,150), breaks = seq(0,150,25))

region <- region + labs(title = "Pollution Exposure, by Region",
                                          subtitle = "Average Exposure to PM2.5 fine particles, 2017", 
                                          caption = "Air quality and health: Exposure to PM2.5 fine particles (OECD)",
                                          y = "",
                                          x = "Pollution Exposure") +
  theme_minimal() +
  theme(axis.title = element_text(size = 8, face = "bold"),
        panel.grid.major.x = element_blank(),
        panel.grid.minor = element_blank(),
        panel.background = element_blank(),
        panel.grid.major.y = element_blank(),
        axis.line.x = element_line(color = "black"),
        axis.ticks = element_line(color = "black"),
        axis.text = element_text(size = 10),
        legend.position = "none",
        plot.subtitle = element_text(size = 8),
        plot.title = element_text(size = 10, margin = margin(b = 10))) +
  scale_color_manual(values = c("darkblue"))


region <- region + theme(plot.title = element_text(color = "black", size = 12, face = "bold", hjust = 0),
                                           plot.subtitle = element_text(color = "black", size = 10, hjust = 0 ),
                                           plot.caption = element_text(color = "black", size =8, face = "italic", hjust =0))
region

4. How does air pollution increase over time? More specifically, are death rates in recent X amount of years higher than death rates from groups of X years before?

library(scales)
#Aggregate data by total deaths by year
deathsbyyear <- group_by(.data = final_df,Year)
totdeath <- summarize(.data = deathsbyyear, tot_deaths = sum(Deaths.Air.Pollution.per.100k, na.rm = TRUE))
str(totdeath)
## tibble [12 × 2] (S3: tbl_df/tbl/data.frame)
##  $ Year      : Factor w/ 28 levels "1990","1991",..: 1 6 11 16 21 22 23 24 25 26 ...
##  $ tot_deaths: num [1:12] 31207 31769 30317 28326 25190 ...
deaths_line <- ggplot() +
               geom_line(data = totdeath, mapping = aes(x = Year, y = tot_deaths, group = 1), size = 1.2) +
              geom_point(data = totdeath, mapping = aes(x = Year, y = tot_deaths, group = 1), size = 1.2) +
              scale_color_manual(values = c("darkmagenta")) +
              scale_y_continuous(label = comma, limits = c(0, 40000), breaks = seq(0,40000,10000)) 


deaths_line <- deaths_line + labs(title = "Deaths per 1000,000 by Air Pollution",
                                            subtitle = "1990 - 2017", 
                                            caption = "?",
                                            y = "air Pollution Induced Deaths",
                                            x = "") +
              theme_minimal() +
              theme(axis.title = element_text(size = 8, face = "bold"),
                    panel.grid.major.x = element_blank(),
                    panel.grid.minor = element_blank(),
                    panel.background = element_blank(),
                    panel.grid.major.y = element_blank(),
                    axis.line.x = element_line(color = "black"),
                    axis.ticks = element_line(color = "black"),
                    axis.text = element_text(size = 10),
                    legend.position = "none",
                    plot.subtitle = element_text(size = 8),
                    plot.title = element_text(size = 10, margin = margin(b = 10)))
            
            deaths_line <- deaths_line + theme(plot.title = element_text(color = "black", size = 12, face = "bold", hjust = 0),
                                         plot.subtitle = element_text(color = "black", size = 10, hjust = 0 ),
                                         plot.caption = element_text(color = "black", size =8, face = "italic", hjust =0))
            deaths_line

5. Main Research Question

Do lower GDP countries have more deaths per 100k due to air pollution?

Is there a correlation between GDP per capita and deaths caused by pollution? Is it linear? How strong is the correlation?

Let’s first look at the general fit on the overall data.

fit1 <- lm(Deaths.Air.Pollution.per.100k ~ gdp.per.capita, data=final_df_sf)

ggplot(final_df_sf, aes(gdp.per.capita, Deaths.Air.Pollution.per.100k)) +
  geom_point() +
  geom_smooth(method='lm', se=FALSE) +
  stat_regline_equation(label.y = 400, aes(label = ..eq.label..)) +
  stat_regline_equation(label.y = 350, aes(label = ..rr.label..))
Fig XX: Linear model (fit1) on overall data, deaths due to air pollution per 100k vs GDP per capita.

From the plot, we observe that there is indeed a negative correlation between deaths due to air pollution per 100k and GDP per capita. However, the strength of that relationship is not particularly strong as the R2 is really low at 0.313. This means that only 29% of the variance experienced in deaths due to air pollution per 100k is caused by GDP per capita in a linear relationship.

final_df_sf_most_recent_5_yrs <- subset(final_df_sf, Year == c(2017, 2016, 2015, 2014, 2013))

fit2 <- lm(Deaths.Air.Pollution.per.100k ~ gdp.per.capita, data=final_df_sf_most_recent_5_yrs)

summary(fit2)
print(vif(fit2))

# plot(fit1)

ggplot(final_df_sf_most_recent_5_yrs, aes(gdp.per.capita, Deaths.Air.Pollution.per.100k)) +
  geom_point() +
  geom_smooth(method='lm')

final_df_sf_asia <- subset(final_df_sf, SDGRegion == 'EASTERNANDSOUTH-EASTERNASIA')

fit3 <- lm(Deaths.Air.Pollution.per.100k ~ log(gdp.per.capita), data=final_df_sf_asia)

summary(fit3)
print(vif(fit3))

# plot(fit1)

ggplot(final_df_sf, aes(log(gdp.per.capita), Deaths.Air.Pollution.per.100k)) +
  geom_point() +
  geom_smooth(method='lm') + 
  facet_wrap(~SDGRegion, ncol = 4)

Is there a difference in means of death caused by pollution between low, mid, and high GDP per capita?

6. Conclusion

7. Bibliography

Fig X: References
Name Link
Making maps with R Link
Geographic data in R Link
Maps in ggplot2 Link
ggplot2 color scales Link